Learning Common Grammar from Multilingual Corpus
نویسندگان
چکیده
We propose a corpus-based probabilistic framework to extract hidden common syntax across languages from non-parallel multilingual corpora in an unsupervised fashion. For this purpose, we assume a generative model for multilingual corpora, where each sentence is generated from a language dependent probabilistic contextfree grammar (PCFG), and these PCFGs are generated from a prior grammar that is common across languages. We also develop a variational method for efficient inference. Experiments on a non-parallel multilingual corpus of eleven languages demonstrate the feasibility of the proposed method.
منابع مشابه
Shared Logistic Normal Distributions for Soft Parameter Tying in Unsupervised Grammar Induction
We present a family of priors over probabilistic grammar weights, called the shared logistic normal distribution. This family extends the partitioned logistic normal distribution, enabling factored covariance between the probabilities of different derivation events in the probabilistic grammar, providing a new way to encode prior knowledge about an unknown grammar. We describe a variational EM ...
متن کاملMultilingual Document Classification via Transductive Learning
We present a transductive learning based framework for multilingual document classification, originally proposed in [7]. A key aspect in our approach is the use of a large-scale multilingual knowledge base, BabelNet, to support the modeling of different language-written documents into a common conceptual space, without requiring any language translation process. Results on real-world multilingu...
متن کاملControlled Natural Language Generation from a Multilingual FrameNet-Based Grammar
This paper presents a currently bilingual but potentially multilingual FrameNet-based grammar library implemented in Grammatical Framework. The contribution of this paper is two-fold. First, it offers a methodological approach to automatically generate the grammar based on semantico-syntactic valence patterns extracted from FrameNetannotated corpora. Second, it provides a proof of concept for t...
متن کاملBridge Correlational Neural Networks for Multilingual Multimodal Representation Learning
Recently there has been a lot of interest in learning common representations for multiple views of data. Typically, such common representations are learned using a parallel corpus between the two views (say, 1M images and their English captions). In this work, we address a real-world scenario where no direct parallel data is available between two views of interest (say, V1 and V2) but parallel ...
متن کاملPrincipled Multilingual Grammars for Large Corpora
We describe a multilingual implementation of such a grammar, and its advantages over both principlebased parsing and ad-hoc grammar design. We show how X-bar theory and language-independent semantic constraints facilitate grammar development. Our implementation includes innovative handling of (1) syntactic gaps, (2) logical structure alternations, and (3) conjunctions. Each of these innovations...
متن کامل